Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. #4928

Conversation

kundadebdatta
Copy link
Member

@kundadebdatta kundadebdatta commented Dec 10, 2024

Pull Request Template

Description

This PR bumps up the Cosmos.Direct package version to 3.37.3 which is released on top of EN20241007 release. In a nut-shell, the change is focused on cleaning up the unhealthy dangling connections, that were created as a part of the "Advanced Replica Selection" flow.

Problem: One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the RNTBD connection creation and management flow in the LoadBalancingPartition. The team is using the .NET SDK version 3.39.0-preview to connect to cosmos db. Recently they encountered a potential memory leak in some of their clusters and upon investigation, it appeared that one of the root cause is the underlying CosmosClient is keeping a high number of unhealthy and unused LbChannelStates.

In a nut shell, below are few of the account configurations and facts:

account-1: 9 Partitions with 1 unique tenant. There are approx 4 to 8 clients for this tenant. 2 * no of replica regions is the client count. They have the connection warm-up enabled on this account.

account-2 2592 Partitions with 249 tenants/ feds. Connections created in happy path scenario: 249 x Y (Y = number of active clients for that account). Connection warm-up disabled on this account.

account-3: 27 Partitions with 13 tenants/ feds. CreateAndInitialize They have the connection warm-up enabled on this account.

To understand this in more detail, please take a look at the memory dump below.

ic3_memory_dump_accounts_hidden

[Fig-1: The above figure shows a snapshot of the memory dump taken for multiple accounts This also unviels the potential memory leak by the unhealthy connection.]

Upon further analysis of the memory dump, it is clear that:

  • The number of stale unhealthy connections are higher in the accounts where we have the replica validation enabled along with the connection warm-up.

  • Without the connection warm-up, the number of stale unhealthy connections are comparatively lower, but still good enough to increase the memory foot-print.

    ic3_new_sdk_high_memory_accounts_hidden

    [Fig-2: The above figure shows how the memory footprint got increased over time, along with incoming requests. The service was finally needed to be restarted to free up the memory]

  • Even without the replica validation feature, the memory footprint showed a consistent increase over time.

    ic3_old_sdk_high_memory_accounts_hidden

    [Fig-3: The above figure shows the memory consumption from the IC3 partner-api service which is using an older version (v 3.25.0) of the .NET SDK and the memory consumption kept increasing with time.]

Analysis : Upon further digging in to the memory dump, and re-producing the scenario locally, it was noted that:

  • With Replica Validation Enabled: Each of the impacted LoadBalancingPartition was holding more than 1 Unhealthy stale LbChannelState (which is a wrapper around the Dispatcher and a Channel), when the connection to the backend replica was closed deterministically.

  • With Replica Validation Disabled: Each of the impacted LoadBalancingPartition was holding exactly 1 Unhealthy stale LbChannelState (which is a wrapper around the Dispatcher and a Channel), when the connection to the backend replica was closed deterministically.

Let's take a look at the below diagram to understand this in more detail:

image

[Fig-4: The above figure shows an instance of the LoadBalancingPartition holding more than one entry of unhealthy LbChannelState.]

By looking at the above memory dump snapshot, it is clear that these stale LbChannelState entries are kept in the LoadBalancingPartition, until they are removed the openChannels list, which is responsible for maintaining the number of channels (healthy or unhealthy) for that particular endpoint. If they are not cleaned up proactively (which is exactly this case), it might end up claiming extra memory overhead. With increasing number of partitions, connections over the time, things get worse with all these unused, yet low hanging LbChannelStates claiming more and more memory, and causing it to a memory leak. This is the potential root cause of the increased memory consumption.

Proposed Solution :

There are few changes proposed to fix this scenario. These are discussed briefly in the below section:

  • During the replica validation phase, in the OpenConnectionAsync(), proactively remove all the Unhealthy connections from the openChannels within the LoadBalancingPartition. This guarantees that any unhealthy LbChannelStates will be removed from the LoadBalancingPartition, freeing up the additional memory.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #4467

@kundadebdatta kundadebdatta changed the title [Internal] Direct Package Upgrade: Refactors Code to Bump Up Cosmos.Direct Package to 3.37.3 Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. Dec 10, 2024
@kundadebdatta kundadebdatta marked this pull request as ready for review December 11, 2024 12:40
@kundadebdatta kundadebdatta self-assigned this Dec 11, 2024
@kundadebdatta kundadebdatta added the auto-merge Enables automation to merge PRs label Dec 11, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot merged commit 1c070d3 into master Dec 11, 2024
24 checks passed
@microsoft-github-policy-service microsoft-github-policy-service bot deleted the users/kundadebdatta/upgrade_direct_package_to_3_37_3 branch December 11, 2024 18:51
kundadebdatta added a commit that referenced this pull request Dec 23, 2024
…bChannelState Object. (#4928)

This PR bumps up the `Cosmos.Direct` package version to `3.37.3` which
is released on top of `EN20241007` release. In a nut-shell, the change
is focused on cleaning up the unhealthy dangling connections, that were
created as a part of the "Advanced Replica Selection" flow.

**Problem:** One or our recent conversations with the IC3 team has
helped us unveiling a potential issue with the RNTBD connection creation
and management flow in the `LoadBalancingPartition`. The team is using
the .NET SDK version `3.39.0-preview` to connect to cosmos db. Recently
they encountered a potential memory leak in some of their clusters and
upon investigation, it appeared that one of the root cause is the
underlying `CosmosClient` is keeping a high number of unhealthy and
unused `LbChannelState`s.

In a nut shell, below are few of the account configurations and facts:

• **account-1:** 9 Partitions with 1 unique tenant. There are approx 4
to 8 clients for this tenant. 2 * no of replica regions is the client
count. They have the connection warm-up enabled on this account.

• **account-2** 2592 Partitions with 249 tenants/ feds. Connections
created in happy path scenario: 249 x Y (Y = number of active clients
for that account). Connection warm-up disabled on this account.

• **account-3:** 27 Partitions with 13 tenants/ feds.
CreateAndInitialize They have the connection warm-up enabled on this
account.

To understand this in more detail, please take a look at the memory dump
below.

![ic3_memory_dump_accounts_hidden](/~https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/b78585de-5318-436a-8511-0107d570e8d7)

_[Fig-1: The above figure shows a snapshot of the memory dump taken for
multiple accounts This also unviels the potential memory leak by the
unhealthy connection.]_

Upon further analysis of the memory dump, it is clear that:

- The number of stale unhealthy connections are higher in the accounts
where we have the replica validation enabled along with the connection
warm-up.
- Without the connection warm-up, the number of stale unhealthy
connections are comparatively lower, but still good enough to increase
the memory foot-print.

![ic3_new_sdk_high_memory_accounts_hidden](/~https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/ea3a4f82-daca-40f8-bcd6-d29da806a3c8)

_[Fig-2: The above figure shows how the memory footprint got increased
over time, along with incoming requests. The service was finally needed
to be restarted to free up the memory]_
- Even without the replica validation feature, the memory footprint
showed a consistent increase over time.

![ic3_old_sdk_high_memory_accounts_hidden](/~https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/471a50c3-1786-4f12-b5e3-1ab192c6fef5)

_[Fig-3: The above figure shows the memory consumption from the IC3
partner-api service which is using an older version (v 3.25.0) of the
.NET SDK and the memory consumption kept increasing with time.]_

**Analysis :** Upon further digging in to the memory dump, and
re-producing the scenario locally, it was noted that:

- **With Replica Validation Enabled:** Each of the impacted
`LoadBalancingPartition` was holding more than `1` Unhealthy stale
`LbChannelState` (which is a wrapper around the `Dispatcher` and a
`Channel`), when the connection to the backend replica was closed
deterministically.

- **With Replica Validation Disabled:** Each of the impacted
`LoadBalancingPartition` was holding exactly `1` Unhealthy stale
`LbChannelState` (which is a wrapper around the `Dispatcher` and a
`Channel`), when the connection to the backend replica was closed
deterministically.

Let's take a look at the below diagram to understand this in more
detail:

![image](/~https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/f4e2d494-962b-41da-ae80-0d542d62accf)

_[Fig-4: The above figure shows an instance of the
`LoadBalancingPartition` holding more than one entry of unhealthy
`LbChannelState`.]_

By looking at the above memory dump snapshot, it is clear that these
stale `LbChannelState` entries are kept in the `LoadBalancingPartition`,
until they are removed the `openChannels` list, which is responsible for
maintaining the number of channels (healthy or unhealthy) for that
particular endpoint. If they are not cleaned up proactively (which is
exactly this case), it might end up claiming extra memory overhead. With
increasing number of partitions, connections over the time, things get
worse with all these unused, yet low hanging `LbChannelState`s claiming
more and more memory, and causing it to a memory leak. This is the
potential root cause of the increased memory consumption.

**Proposed Solution :**

There are few changes proposed to fix this scenario. These are discussed
briefly in the below section:

- During the replica validation phase, in the `OpenConnectionAsync()`,
proactively remove all the `Unhealthy` connections from the
`openChannels` within the `LoadBalancingPartition`. This guarantees that
any unhealthy `LbChannelStates` will be removed from the
`LoadBalancingPartition`, freeing up the additional memory.

Please delete options that are not relevant.

- [x] Bug fix (non-breaking change which fixes an issue)

To automatically close an issue: closes #4467
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Enables automation to merge PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LoadBalancingPartition: Clean-Up Unhealthy Dangling Connections.
3 participants